A neural theory for counting memories

Keeping track of the number of times different stimuli have been experienced is a critical computation for behavior. Here, we propose a theoretical two-layer neural circuit that stores counts of stimulus occurrence frequencies. This circuit implements a data structure, called a count sketch, that is commonly used in computer science to maintain item frequencies in streaming data. Our first model implements a count sketch using Hebbian synapses and outputs stimulus-specific frequencies. Our second model uses anti-Hebbian plasticity and only tracks frequencies within four count categories (“1-2-3-many”), which trades-off the number of categories that need to be distinguished with the potential ethological value of those categories. We show how both models can robustly track stimulus occurrence frequencies, thus expanding the traditional novelty-familiarity memory axis from binary to discrete with more than two possible values. Finally, we show that an implementation of the “1-2-3-many” count sketch exists in the insect mushroom body.


Supplementary Methods
Datasets and pre-processing. The first dataset, Synthetic, consists of N = 1000 inputs with d = 50 dimensions per input, where each dimension is drawn randomly from an exponential distribution with a fixed mean (λ = 10). This distribution was selected because several types of neural stimuli, such as faces 1 and odors, 2 are encoded as an exponential distribution of firing rates over a population of neurons, with an approximately fixed mean; i.e., the inputs are encoded using a maximum entropy code. 3 The second dataset, Odors, is experimentally collected response data of d = 24 olfactory receptor neurons (ORNs) in the fruit fly to N = 110 odors. 4 We fixed the mean of each odor to be constant to mimic the divisive normalization processing that occurs from ORNs to projection neurons in the antennae lobe. 2 This results in an odor representation that is concentration-independent. The third dataset, MNIST, consists of N = 10000 images of handwritten digits. Because these raw images consist largely of black pixels, the similarity between many pairs of images, regardless of their class, will be quite high. So, instead of using the raw pixel representation, we trained a LeNet5 network 5 using the 10 class labels. We then extracted a d = 84 dimensional feature representation of each image from the inner-most hidden layer in the LeNet5 network. This representation better captured the true similarity structure of digits and resulted in less count interference, compared to the raw input.
Prior to generating the sequence of observations, we reduced each dataset such that there were no pairs of inputs that were highly correlated with each other. Specifically, in each iteration, we selected a random input. If this input did not correlate with any previously kept input, it was kept; otherwise, it was discarded. We iterated until no more inputs could be added. We set the maximum pairwise correlation between any two kept inputs to be no more than 0.80 (for Synthetic, Odors) or 0.70 (MNIST). This process removed no inputs from the Random dataset; it reduced the Odors dataset to N = 62 odors; and it reduced the MNIST dataset to N = 180 images.
To generate the sequence of observed items, from the reduced dataset of items (X ), we drew n = N random samples with replacement according to a Zipf distribution; i.e., in each sampling step, the i th item (ordered arbitrarily) was selected with probability ∝ i −a , where a = 0.55.
For the random matrix M , we used a sparse binary matrix with 6 ones per row, modeling the connectivity between projection neurons and Kenyon cells in the insect mushroom body. 6 2. Supplementary Note 1: Two counting schemes (a) The input layer, x ∈ R d .
Here M is an m × d random projection matrix whose rows are drawn i.i.d. from some distribution Q. For instance, Q could be the uniform distribution over the d-dimensional unit sphere, or the uniform distribution over all vectors in {0, 1} d with exactly c ones, for some small constant c.
We will often look at a single row of the random projection matrix, and denote it by θ ∈ R d .
(c) After winner-take-all operation, z ∈ {0, 1} m . This is given by the rule z j = 1 if y j is one of the k largest entries of y 0 otherwise For ease of analysis, we will use an alternative process that produces similar behavior (Section 2.1): it chooses the largest entries of y, and chooses k of them in expectation.
(d) Two counting schemes that store information in a vector w ∈ R m .
• Neural count-sketch (a Hebbian learning rule): -Initialize w 1 = · · · = w m = 0 -When item x is observed, with tag z: set w = w + z At any given time, the frequency estimate of an item x, with tag z, is given by w, z /k. • 1-2-3-many sketch (an anti-Hebbian learning rule): -Initialize w 1 = · · · = w m = 1 -When item x is observed, with tag z: set w j = w j e −βz j for some constant β > 0 At any given time, the 1-2-3-many frequency estimate of an item x, with tag z, is given by w, z /k. This value is roughly 1 if the item has not been seen before (i.e., it is being seen for the first time), e −β if it has been seen once before, e −2β if it has been seen twice, and less than e −3β if it has been seen more often.

An alternative formulation of winner-takes-all
The winner-take-all operation produces z ∈ {0, 1} m where z j = 1 if y j is one of the k largest entries of y 0 otherwise To facilitate the analysis, we will work with an alternative formulation that produces similar results: Here τ x is a threshold that is allowed to depend on x. Notice that the rule (y j ≥ τ x ) indeed picks out the largest entries of y, but might choose more or less than k elements, depending on how τ x is set. We will specify τ x so that, in expectation, k entries of y are picked out: • For a given x and for any fraction 0 < f < 1, define τ x (f ) to be the top f -fractile value of the distribution of θ · x, where θ ∼ Q, as follows: For instance, τ x (1/2) is a median value of θ · x.
• Set τ x = τ x (k/m), so that where the approximation arises from possible discretization issues. For convenience, we will henceforth assume that this is an exact equality: Thus that for any x, the expected number of ones in z is exactly k. Although this τ x in general depends on x, the scenarios we study have enough symmetry that these thresholds turn out to be the same for all x.
Notice also that this rule is scale-invariant: scaling the x's by a multiplicative constant will simply lead to τ x being scaled by the same constant.

Supplementary Note 2: Theoretical analysis
The method of tag generation effectively induces a similarity function on the input space, s : X × X → [0, 1]. It can be shown that dot products in tag space are, in expectation, proportional to similarity values in X -space. Our analysis makes heavy use of this connection.

The underlying similarity function
For any two inputs x, x ∈ R d , define To interpret this, let z, z ∈ {0, 1} m be the projected-and-thresholded versions of x, x , respectively. Then s(x, x ) is proportional to the probability that both z j and z j are set to 1, for any specific coordinate 1 ≤ j ≤ m; it is scaled by m/k to map it onto the range [0, 1]. The largest possible value of s(x, x ) is 1, when x = x . If x is far from x, in some suitable sense, then s(x, x ) will be much smaller. Thus s(x, x ) can be thought of as a measure of similarity between x and x . It is a standard type of similarity function called a kernel function, which is another way of saying that it is positive semidefinite.
Lemma 1 s : X × X → R is a kernel function.

Similarity function: examples
In earlier work, 7,8 we have derived the form of the similarity function in certain settings.
• Binary inputs and projection matrix. Here the inputs are binary vectors with b ones, that is, they lie in the set and the random projection matrix is binary, where Q is the uniform distribution over vectors in {0, 1} d that contain exactly c ones, for some integer c. The thresholds τ x are all set to c; since we need Pr θ (θ · x ≥ τ x ) = k/m, this means that Then the similarity function has the form • Gaussian projection matrix. In this case, the inputs are unit vectors and each entry of the projection matrix is picked from a standard normal distribution; thus Q is N (0, I d ). It can be shown that .
In both these cases, it can also be seen that when x, x are chosen independently, s(x, x ) ≈ k/m. It is useful to think of k/m as the similarity value between random independent inputs.

Dot products in tag-space
In Section 4.2, we show that for any two inputs x, x , their tags Z, Z (which are random variables because they depend on the random projection matrix M ) satisfy Moreover, the actual value of Z, Z is tightly concentrated around this expectation. The neural count sketch is based exclusively on dot products in z-space and thus this connection is essential to its analysis. For the 1-2-3-many sketch, we use a different approach.

The neural count sketch
Suppose that the neural count sketch receives a sequence of n observations in X . What is its frequency response on a subsequent input x?
The notion of a frequency estimate makes the most sense in discrete spaces, whereas in our setting, the input space may well be continuous. One approach to dealing with this is to suitably discretize the space, for instance by focusing on situations in which distinct observations are wellseparated from one another. We follow this line of reasoning in Section 3.4.1. In Section 3.4.3, we show that for continuous input spaces the response of the neural count sketch can be seen as a kernel density estimate.

Frequency estimation
Suppose that so far only N distinct observations have appeared, x (1) , . . . , x (N ) , and that x (i) has been seen f i times. Thus the total number of observations, counting duplicates, is n = f 1 + f 2 + · · · + f N . Given a new input x, we would like the frequency estimate for it to be f if x = x ( ) and 0 if x is different from all the x (i) .
A basic difficulty is that the space X might be continuous and some of the x (i) might be close together. To minimize interference effects, we assume this cannot happen.
If the distinct x (i) were chosen independently, we'd expect to have s(x (i) , x (j) ) ≈ k/m, as discussed above. Thus the case ξ = k/m is of particular interest.
Theorem 2 Pick any 0 < δ < 1 and define σ = (1/(3k)) ln(3/δ). Suppose that the neural count sketch witnesses n observations satisfying Assumption 1. Given a subsequent input x: and at most (c) If x is far from all the x ( ) in the sense that s(x, x ( ) ) ≤ ξ for all 1 ≤ ≤ N , then the expected response lies in the interval [0, ξn]. With probability at least 1 − δ, the response is at most To interpret this result, suppose that ξ = k/m; as explained above, this is what we would expect if distinct inputs were picked independently. Suppose also that we would like frequency estimates to be accurate within ±1. For this to hold in expectation, it would be enough to have m ≥ kn. For it to hold with probability ≥ 1 − δ, for frequencies in the range [0, f ], it would be sufficient to take m ≥ 2kn and k = Ω(max(n, f 2 ) ln(1/δ)).
In the literature on streaming algorithms, it is common to consider much looser frequency estimates which are accurate within ± n for some constant > 0. The neural count sketch would achieve this with probability 1 − δ if m ≥ 2k/ and k = Ω((1/ 2 ) ln(1/δ)).

Bounded precision
Recall that in the count sketch, the weights w j start off at zero and change only when incremented. We now define a notion of what it might mean for such weights to have "b bits of precision". • If w j < 2 b , then the operation w j + 1 produces the correct value.
• If w j ≥ 2 b , then the operation w j + 1 results in some value in the range [w j , B].
In short, the weights behave linearly in the range [0, 2 b ] and then saturate at value B.
Under these conditions, the guarantees of Theorem 2 would continue to hold for queries x with frequency ≤ O(2 b ), while higher frequencies would get squashed between 2 b and B.

The neural count sketch as a density estimator
The analysis of the previous section applies to a situation where the observations are essentially discrete: any two observations are either duplicates or are far from each other. This is the setting for work on frequency estimation in the streaming algorithms literature.
When the observation space is continuous, it might be more helpful to view the response of the neural count sketch as a kernel density estimate. To see how this comes about, suppose that after seeing observations x (1) , . . . , x (n) ∈ X , we define where z is the tag of x. From Lemma 11, we see that this has expectation (over the choice of matrix M ) and moreover is tightly concentrated around this value. Since each s(x, x (i) ), seen as a function of x, is continuous, takes values in [0, 1], and is maximized at x = x (i) , we can think of it as an unnormalized probability density centered at x (i) . Thus f n is very much like a kernel density estimate, as long as the normalizing constants for the different s(·, x (i) ) are equal.

Definition 4
We say that similarity function s : X ×X → [0, 1] has uniform normalization if there is a constant C > 0 such that for all x o ∈ X , where λ denotes Lebesgue measure.
The two examples of similarity functions in Section 3.2 both have this property.

The 1-2-3-many sketch
We now analyze the performance of the 1-2-3-many sketch as a frequency estimator. For an input x, we would hope that the response of the sketch, W, Z /k, is roughly 1 if x has not been seen before, or e −β if it has been seen once, or e −2β if seen twice, and so on. Equivalently, we can think of the frequency estimate on x as rounded to the nearest integer.

Frequency estimation
We study the 1-2-3-many sketch under the same conditions as the neural count sketch (Assumption 1). Recall that a total of n observations are seen, consisting of arbitrarily-interleaved repetitions of N distinct items.
This estimate is significantly more accurate than the count-sketch for comparable settings of k and m. First, recall that in order to get good behavior of the count sketch, we needed to take ξ ≤ 1/(2n) (or equivalently, m ≥ 2kn), where n is the total number of observations, including repeats. Here we only need ξ ≤ β/(2N ) (equivalently, m ≥ (2/β)kN ), where N is the number of distinct observations. Second, for the count sketch, we took k = O(n) to get estimates accurate within ±1, even for small frequencies. In contrast, here we only need k = O(1).
To understand the shift in dependence from n to N , consider a situation in which there are relatively few distinct observations x (1) , . . . , x (N ) , but many repetitions of each. Even if these observations are chosen at random, their tags may still overlap to some extent. Nevertheless, if ξ ≤ c/N for some constant 0 < c < 1, then it is very likely that for any given x ( ) , a constant fraction of the 1's in its tag will not overlap those of the other N − 1 tags. This pristine part of x ( ) 's tag is what enables the 1-2-3-many sketch to work well, no matter how many times the other observations are repeated. But in the count sketch, the integration of coordinates of Z is done differently and as a result, the noise in the non-pristine part of the tag eventually dominates the frequency estimate.

Bounded precision
Recall that in the 1-2-3-many sketch, all weights w j start at 1 and are subsequently changed only by multiplication by e −β . Thus they are monotonically decreasing with time and remain in the range [0, 1]. We now define a model of bounded precision for such weights.  Under these conditions, the results of Theorem 5 continue to hold for queries with frequency ≤ (b/β) ln 2. For larger frequencies, the response will be some number greater than or equal to this. Thus if b is at least (roughly) 4β, the 1-2-3-many sketch lives up to its name. The 1-2-3-many sketch can be seen as a generalization of a Bloom filter, which is in essence a "1-many" sketch: it keeps track of which items are being seen for the first time. The Bloom filter is also based on hashing and has roughly the same complexity, in the sense that it uses k = O(log(1/δ)) hash functions and a total table size of m = O(kN ). It is known that the Bloom filter's resource requirements are optimal within constant factors, and thus the same must hold for the 1-2-3-many sketch.

Supplementary Note 3: Proof details 4.1 Proof of Lemma 1
First we observe that the function is symmetric by definition: s(x, x ) = s(x , x). Next, pick any finite set of points x (1) , . . . , x (n) ∈ X and a 1 , . . . , a n ∈ R. We need to show that To this end, pick θ ∼ Q and define the random variable U i to be 1 if θ · x (i) ≥ τ x (i) , and 0 otherwise. Then

Dot products in z-space
In what follows, let z(x, M ) ∈ {0, 1} m denote the tag for input x if random projection matrix M is used.
Lemma 7 (Expected dot products in z-space) Fix any x, x ∈ R d . Then, Proof: For fixed x, x , we have where we have used the fact that the values of z(x, M ) j and z(x , M ) j depend only on the jth row of M , which is drawn from distribution Q.
Next, we show that these dot products between tags are tightly concentrated around their expected values. We use a form of Bernstein's inequality 9 from Theorem 2.10 of Boucheron et al. 10 that we reproduce here.
Lemma 8 (Bernstein's large deviation inequality) Suppose X 1 , . . . , X n are independent and identically distributed random variables that satisfy X i ≤ EX i + c for some constant c > 0. Then for any 0 < δ < 1, the following holds with probability at least 1 − δ: We start with a concentration result that applies to a fixed pair of inputs x, x .
Lemma 9 (Concentration of dot products in z-space) Pick any x, x ∈ R d and any 0 < δ < 1. Then with probability at least 1 − δ, Then z(x, M ), z(x , M ) = U 1 + · · · + U m , and by Lemma 7 this has expected value k · s(x, x ). The bound follows from observing that the U j are independent (they depend on different rows of matrix M , chosen independently) and applying the Bernstein bound of Lemma 8.
Next, we give a concentration result that holds uniformly over the entire space X .
Lemma 10 (Uniform concentration of dot-products in z-space) There is an absolute constant C > 0 for which the following holds. Pick any 0 < δ < 1. Then with probability at least 1 − δ over the choice of M , This is the intersection of two halfspaces, and thus the class of all these functions, F = {f x,x : x, x ∈ X }, has VC dimension O(d). We can then apply standard relative VC bounds (Theorem 5.1 of Boucheron et al. 11 ) to get the following: • Pick θ 1 , . . . , θ m ∼ Q independently • Then with probability at least 1 − δ, where γ m = 4(2VC(F) ln 2m + ln(8/δ))/m.
In our setting, taking M to be the matrix with rows θ 1 , . . . , θ m , and thus by Lemma 7 E f x,x = (k/m)s(x, x ). The lemma statement now follows immediately.
Finally, we give a specialized concentration bound that applies to the inner product w, z where w is sum of the tags of n observations. Lemma 11 (Concentration of frequency response) Pick any x (1) , . . . , x (n) , x ∈ X and any 0 < δ < 1. Suppose matrix M is picked by sampling its rows i.i.d. from distribution Q. Let Z (i) be a shorthand for the random variable z(x (i) , M ) and Z for z(x, M ). Let W = Z (1) + · · · + Z (n) . Then: Proof: Part (a) follows immediately from Lemma 7. For part (b), start by defining U j = (Z (1) j + · · · + Z (n) j )Z j for 1 ≤ j ≤ m. The U 1 , . . . , U m are independent and identically distributed, with range 0 ≤ U j ≤ n, expected value and expected squared value Now, W, Z = U 1 + · · · + U m , and the result follows by applying a Bernstein large deviation bound (Lemma 8) to this sum.

Proof of Theorem 2
Let Z (i) be a shorthand for z(x (i) , M ), the tag assigned to observation x (i) with random projection matrix M . After seeing the n observations, the neural count sketch has weight vector Let Z be short for z(x, M ), the tag assigned to x. For part (a), we have from Lemma 11(a) that the expected response of the count sketch on input x is For part (b), we take x = x ( ) and divide the response into two parts: Now m is the sum of m independent 0 − 1 random variables that each have expected value k/m. By a multiplicative Chernoff bound, with probability at least 1 − 2δ/3, The remaining term of (4) is nonnegative and by Lemma 7 has expected value where the last step invokes Assumption 1. By Lemma 11(b), with probability at least 1 − δ/3, this term exceeds its expected value by at most n 3k ln 3 δ + 2n k · nξ · ln 3 δ , For part (c), we follow the same reasoning as (b), but focus only on the second term in (4).

Proof of Theorem 5
In this proof, we'll deal with the more general case where ξ ≤ c/(N −1), for some constant 0 < c < 1, and then specialize later.
Let Z (i) = z(x (i) , M ) be a shorthand for the tag of x (i) under random projection matrix M . It follows from the update rule of the 1-2-3-many sketch that after seeing n observations satisfying Assumption 1, the weight vector W has coordinates Let Z = z(x, M ) be the tag of x. The response on x is then W, Z /k.
Let's analyze this response when x = x ( ) , so that Z = Z ( ) . For any 1 ≤ j ≤ m with Z j = 1, we have first of all that To upper bound the sum on the left, we look at how many of the Z (i) j , i = , are set to one.
By Markov's inequality, with probability at least 1 − c, this summation is < 1 and hence (since it is integral) zero. If this is true, then we also have i = f i Z (i) j = 0. Hence, Thus whenever Z j = 1, we have W j ≤ e −βf and Pr(W j = e −βf |Z j = 1) ≥ 1 − c.
We can now show part (a) of theorem statement. Take any 0 < δ < 1 and 0 < ≤ 1/2. We can apply Chernoff bounds to show that with probability at least 1 − δ, Z 1 + · · · + Z m ≤ (1 + )k (E 1 ) If and c are both ≤ β/2, then this value is either f or f + 1 when rounded to the nearest integer.